Improving integration quality for heterogeneous data sources

نویسنده

  • Evgeniya Altareva
چکیده

This work considers a problem of integrating heterogeneous semi–structured data sources with the purpose of estimating integration quality (IQ). During the integration of such data sources the IQ estimation plays an important role, because correspondences and dependencies within and across the sources are not completely known, the schema or semantics might be missing, which leads to results with unpredictable trustworthiness. Therefore, we consider existing methods of analysis of such data sources and investigate a possible scenario of the integration process. We analyze a problem of uncertainty in the integration process. For that we introduce examples demonstrating present inability of accounting for the combined uncertainties affecting integration quality. We introduce a classification of the types of uncertainties. In order to account for the uncertainties we suggest using the statistical method Latent Class Analysis (LCA), related to the Latent Variable Models. This method allows to analyze the influence of the latent factors on the set of data. As related to the task of integration, by a latent factor we understand belonging of an object to a real–world class and in its turn the role of LCA is to interpret correlation of discovering identical objects from different data sources as a display of that universal factor. We build a statistical model of the integration task, i.e., draw correspondences between the terms of statistics and the terms of integration. Presence of at least three data sources is necessary for making use of LCA, at that, when integrating two sources, an integrated database itself can represent a lacking third source. The result of the analysis is the probability value of the real–world class membership for a considered group of objects. Derived by LCA real–world class membership value includes influence of all types of uncertainties and reflects IQ. By applying LCA to each triplet of the corresponding classes at the lowest schema level and obtaining real–world class membership, we can calculate the support of the real–world class for any level of the database, including database itself, as a weighted average of the real–world class membership for all classes at the lowest level. The proposed approach does not solve common problems of integration of the heterogeneous data sources, but rather can be used for evaluating and improving IQ. Capability to evaluate the IQ gives an important tool to the users concerned with the data’s trustworthiness. It helps them to answer the

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Uncertain Data Integration Using Functional Dependencies

Data integration systems are crucial for applications that need to provide a uniform interface to a set of autonomous and heterogeneous data sources. However, setting up a full data integration system for many application contexts, e.g. web and scientific data management, requires significant human effort which prevents it from being really scalable. In this paper, we propose IFD (Integration b...

متن کامل

An Uncertain Data Integration System

Data integration systems offer uniform access to a set of autonomous and heterogeneous data sources. An important task in setting up a data integration system is to match the attributes of the source schemas. In this paper, we propose a data integration system which uses the knowledge implied within functional dependencies for matching the source schemas. We build our system on a probabilistic ...

متن کامل

Ontology Based Data Integration with User Feedback

Many applications need to access multiple heterogeneous data sources. The integration of these data sources raises several semantic heterogeneity problems. In existing systems, it is difficult to provide the high quality result to the end users due to the heterogeneity, inaccuracy of the facts in data sources. This paper is proposed to resolve semantic heterogeneity problem in integration by us...

متن کامل

Conceiving a Multiscale Dataspace for Data Analysis

A consequence of the intensive growth of information shared online is the increase of opportunities to link and integrate distinct sources of knowledge. This linking and integration can be hampered by different levels of heterogeneity in the available sources. Existing approaches focusing on heavyweight integration – e.g., schema mapping or ontology alignment – require costly upfront efforts to...

متن کامل

The Integration of Hetrogeneous Data Sources for Quality Based Dynamic Source Integration System

Data integration is a challenging domain for combing data from different sources. This provides users a unified view of data. Data integration system constitutes a major importance in current real time application and characterized by some issues related from speculative conclusion of view. Integration of multiple heterogeneous data sources into real time application is a time consuming and cos...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004